Introduction: Centralized Logging

Get an introduction to centralized logging.

Overview#

We used to store logs in files. That wasn’t so bad when we only had a few applications and they were running on dedicated servers. Finding logs was easy given the nature of static infrastructure. Locating issues was easy as well when there was only an application or two. But times have changed. Infrastructure grew in size. Now we have tens, hundreds, or even thousands of nodes. The number of applications has increased as well. More importantly, everything has become dynamic. Nodes are being created and destroyed. We started scaling them up and down. We began using schedulers, and that means our applications started “floating” inside our clusters.

Everything became dynamic and volatile. Everything increased in size. As a result, hunting for logs and going through endless entries to find issues became unacceptable. We needed a better way.

Today, the solution is centralized logging. As our systems became distributed, we had to centralize the location where we store logs. As a result, we got “log collectors” that gather logs from all the parts of the systems and ship them to a central database. Centralized logging became the de facto standard and a must in any modern infrastructure.

There are many managed third-party solutions like Datadog, Splunk, and others. Cloud providers started offering solutions like Google Cloud Operations (formerly known as Stackdriver), AWS CloudWatch, and others. Almost every Cloud provider has a solution.

When it comes to self-hosted options, ELK stack (Elasticsearch, Logstash, and Kibana) became the de facto standard. Elasticsearch is used as the database, Logstash for transferring logs, and Kibana for querying and visualizing them.

We’re about to explore self-managed logging and leave managed logging aside. Since we’ve already stated that the ELK stack is the de facto standard, it might seem like the obvious choice. But it’s not. We’ll go down a different route.

Why Not Using The ELK Stack?#

The problem with the ELK stack lies in its ad hoc nature. Elasticsearch is a database built around Lucene. It’s a full-text search engine. It’s designed for fuzzy search across vast amounts of human-written text. It handles complex tasks, resulting in a CPU- and memory-intensive solution. In particular, it creates a sizeable in-memory index that eats resources for breakfast. Elasticsearch can quickly become the most demanding application in a system. That’s especially true if we introduce clusterization, sharding, and other scaling options.

What we're really trying to say is that the ELK stack is overqualified for the task. It provides a full-text analytics platform at a high resource and administrative cost, while the essential logging requirement is just a distributed grep. There was increasing demand for such tools and several attempts at creating distributed “grep-style” log aggregation platforms. Many failed due to lack of their own Kibana, even when aggregation and storage were done right, but querying and visualization were lackluster or completely absent.

The ELK stack is demanding on resources and hard to manage. It might be overkill if our needs focus on logs and not all the other tasks it can perform. The problem is that, within self-hosted solutions, there wasn’t much choice until recently.

Using Loki to store and query logs#

Grafana Labs started working on the project Loki somewhere around mid-2019. It describes itself as “Like Prometheus, but for Logs.” Given that Prometheus is the most promising tool for storing metrics, at least among open-source projects, that description certainly sparks interest.

The idea is to abandon text indexing and instead attach labels to log lines in a way similar to metric labels in Prometheus. In fact, it’s using precisely the same Prometheus labeling library. It might not be apparent why this is a brilliant idea, so let’s elaborate a bit.

Indexing a massive amount of data is very compute intensive and tends to create issues when running at scale. By using labels instead of indexing logs, Loki made itself compute-efficient. But that’s not the only benefit.

Having the same set of labels on application logs and metrics helps immensely in correlating those two during investigations. On top of that, logs are queried with the same PromQL as metrics stored in Prometheus. To be more precise, Loki uses LogQL, which is a subset of PromQL. Given that querying metrics tend to need reacher query language, the decision to use a subset makes sense. On top of all that, if we adopted Prometheus, we wouldn’t need to learn another query language. Even if Prometheus is not our choice, its metrics format and query language are being adopted by other tools, so it’s an excellent investment to learn it.

The UI for exploring logs is based on Grafana, which happens to be the de facto standard in the observability world. The release 6.0 of Grafana added the "Explore" screen with the support for Loki included from day one. It allows us to query and correlate both logs from Loki and metrics from Prometheus in the same view.

The result of those additions meant that suddenly we went from not only having a good log querying solution, but one that is arguably better than Kibana.

Hands-On: Using Self-Managed Containers As A Service on AWS

Installing Loki, Grafana, Prometheus, and the Demo App